-
Notifications
You must be signed in to change notification settings - Fork 102
Add a User-Defined Table Function example #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
69e3edd to
12a7010
Compare
- Add k-means clustering UDTF example for Unity Catalog - Focus documentation on UDTF pattern and SQL accessibility - Include Python implementation and SQL usage examples - Add CI/CD integration instructions
ee6874e to
986dd79
Compare
| import csv | ||
| import os | ||
| except ImportError: | ||
| raise ImportError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not great for debugging to lose original stack trace. Why not add the message but keep the stacktrace?
try:
...
except:
print(..., file.sys.stderr)
raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the default-python template though :o The idea is to offer a very short message without a long stack trace to humans when they type pytest instead of uv run pytest
| register_udtf_job: | ||
| name: register_udtf_job | ||
| schedule: | ||
| quartz_cron_expression: '0 0 8 * * ?' # daily at 8am |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q - why do you need to register the same udtf daily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related q - how to unregister? is it just removing this job and there is some clean up process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Daily register: It would be much better if we had a "job run" resource that could run the job post-deploy. Running it daily is a compromise. The README talks more about this: customer can leave this compromise in place or they can extend their CI scripts to run the job after deploy.
Unregister: maybe you have ideas how to do that with a "job run" resource in place? Maybe a parameter to the job that runs it before it is destroyed? Could that work?
| if not self.rows: | ||
| return | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment why imports are here and not at the top level? Is it so that registration works without these libraries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is not essential, let me change this, good callout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I was wrong. Agent added a comment after I told them
hey in #138 there is a comment about inline imports. can you
look at that comment. fix the issue. then verify things still work: the tests would need to run, the job would
need to be deployable + successfully run end-to-end. ultrathinkif you can do all this. then rate how well you did at making sure everything works well. if it's >8 out of 10,
then commit and push!
| def register(catalog: str, schema: str, name: str = "k_means"): | ||
| """Register k_means UDTF in Unity Catalog""" | ||
| spark = SparkSession.builder.getOrCreate() | ||
| source = inspect.getsource(SklearnKMeans) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this function gets access to dependencies? Does it rely on sklearn being pre-provisioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, yes, there apparently is a gap in UDTFs in specifying dependencies. That gap will be fixed.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
This adds an example for defining a User-Defined Table Function (UDTF) in Unity Catalog.
Highlighted files:
src/kmeans_udtf.py: registers the UDTFtests/test_kmeans_udtf.py: unit test for registration and usage (run withuv run pytest)resources/register_udtf_job.yml: job that registers the UDTF